1 R Keeping raw data in tibbles

Tibbles are similar to data frames with a few tricks up their sleave. One is that you can use so called list columns. So instead of having datasets (or LC-MS raw data) in a list or in different variables it can be in one table with associated meta data.

2 Reading raw data into a tibble/table with meta data

Lets take some raw data data and make it into a table. We will do this in the pipe/dplyr because that is much more readable. First we load some packages we will need in this tutorial.

library(MetabolomiQCsR)
library(purrr)
library(tidyr)
library(dplyr)
library(faahKO)
library(magrittr)
library(ggplot2)
library(plotly)
library(knitr)

Now we first find the path to some test data:

files <- file.path(find.package("faahKO"), "cdf/KO") %>% 
         list.files(full.names=TRUE)

files
## [1] "C:/Users/JPZS/Documents/R/win-library/3.3/faahKO/cdf/KO/ko15.CDF"
## [2] "C:/Users/JPZS/Documents/R/win-library/3.3/faahKO/cdf/KO/ko16.CDF"
## [3] "C:/Users/JPZS/Documents/R/win-library/3.3/faahKO/cdf/KO/ko18.CDF"
## [4] "C:/Users/JPZS/Documents/R/win-library/3.3/faahKO/cdf/KO/ko19.CDF"
## [5] "C:/Users/JPZS/Documents/R/win-library/3.3/faahKO/cdf/KO/ko21.CDF"
## [6] "C:/Users/JPZS/Documents/R/win-library/3.3/faahKO/cdf/KO/ko22.CDF"

Then we read the raw data with a wrapper that puts it into a table.

raw_tbl <- files %>% xcmsRaw_to_tbl

raw_tbl
## # A tibble: 6 × 4
##       file polarity           raw
##     <fctr>    <chr>        <list>
## 1 ko15.CDF  unknown <S4: xcmsRaw>
## 2 ko16.CDF  unknown <S4: xcmsRaw>
## 3 ko18.CDF  unknown <S4: xcmsRaw>
## 4 ko19.CDF  unknown <S4: xcmsRaw>
## 5 ko21.CDF  unknown <S4: xcmsRaw>
## 6 ko22.CDF  unknown <S4: xcmsRaw>
## # ... with 1 more variables: path <fctr>

A single raw object could be accessed like this:

raw_tbl$raw[[1]]
## An "xcmsRaw" object with 1278 mass spectra
## 
## Time range: 2501.4-4499.8 seconds (41.7-75 minutes)
## Mass range: 200-600 m/z
## Intensity range: 70-1373180 
## 
## MSn data on  0  mass(es)
##  with  0  MSn spectra
## Profile method: bin 
## Profile step: 1 m/z (401 grid points from 200 to 600 m/z)
## 
## Memory usage: 10.7 MB

3 Extracted ion chromatograms

3.1 TIC for a single raw file

To get a TIC we can use get_EICs with m/z range from -Inf to Inf. Lets create a TIC for one of the raw files and plot it with plot_chrom:

TIC_tbl <- raw_tbl$raw[[1]] %>% 
                                get_EICs(range_tbl = data.frame(mz_lower = -Inf,mz_upper = Inf)) %>% 
                                extract2(1) # since the function can work on multiple ranges we get a list. 
                                            # So we take first element.

plot_chrom(TIC_tbl, RT_col = "scan_rt", Intensity_col = "intensity")

3.2 BPI for a single raw file

We can also get the BPI instead.

BPI_tbl <- raw_tbl$raw[[1]] %>% 
                                get_EICs(range_tbl = data.frame(mz_lower = -Inf,mz_upper = Inf), BPI = TRUE) %>% 
                                extract2(1)

plot_chrom(BPI_tbl, RT_col = "scan_rt", Intensity_col = "intensity")

3.3 TIC for a single raw file, filtered for certain masses

If there is a contaminant (or a continuous calibrant) you can exclude it. In this case there is no such thing in the data so for illustrative purposes we can remove a peak from the TIC (see if you can find it!).

TIC_tbl <- raw_tbl$raw[[1]] %>% 
                            get_EICs(range_tbl = data.frame(mz_lower = -Inf, mz_upper = Inf), 
                                     exclude_mz = 279.000, 
                                     exclude_ppm = 1000
                                    ) %>% 
                            extract2(1)

plot_chrom(TIC_tbl, RT_col = "scan_rt", Intensity_col = "intensity")

3.4 EIC for a single raw file

Now lets do an EIC instead of a TIC. We can look at the mass we excluded above. We again take out just a single raw file. First we need to get the m/z range of the EIC slice.

ppm <- 1000 # this is not accurate mass data

EIC_intervals <- data_frame(mz=279.000, 
                            mz_lower = mz-(ppm/1E6)*mz, 
                            mz_upper = mz+(ppm/1E6)*mz
                           ) 

EIC_intervals
## # A tibble: 1 × 3
##      mz mz_lower mz_upper
##   <dbl>    <dbl>    <dbl>
## 1   279  278.721  279.279

Now we get the EIC data and plot it.

EIC_data <- get_EICs(raw_tbl$raw[[1]], EIC_intervals) %>% extract2(1)
    
EIC_data
## # A tibble: 1,278 × 3
##     scan intensity  scan_rt
##    <int>     <dbl>    <dbl>
## 1      1      2598 41.68963
## 2      2      2515 41.71572
## 3      3      2452 41.74180
## 4      4      2525 41.76788
## 5      5      2717 41.79397
## 6      6      2805 41.82005
## 7      7      2745 41.84613
## 8      8      2644 41.87222
## 9      9      2727 41.89830
## 10    10      2972 41.92438
## # ... with 1,268 more rows
p <- plot_chrom(EIC_data, RT_col = "scan_rt", Intensity_col = "intensity")

p

As a side note we can also get interactive plots with plotly:

p  %>% ggplotly %>% layout(margin = list(l = 80, b = 60))

3.5 EICs for multiple raw files

Now lets get many EICs at the same time for all the files. This might be a bit hairy but I will try to explain each step.

First we make a table of ranges:

ppm <- 1000 # this is not accurate mass data. Used to create intervals below.

range_tbls <-   data_frame(mz = c(508.1, 279.0, 577.3)) %>% 
                mutate(mz_lower = mz-((ppm)/1E6)*mz, 
                       mz_upper = mz+((ppm)/1E6)*mz
                       )

range_tbls
## # A tibble: 3 × 3
##      mz mz_lower mz_upper
##   <dbl>    <dbl>    <dbl>
## 1 508.1 507.5919 508.6081
## 2 279.0 278.7210 279.2790
## 3 577.3 576.7227 577.8773

Then we need to make a copy of the table for each of the files. This is so that they can go in the table together.

range_tbls %<>% list %>% 
                rep(nrow(raw_tbl)) %>% 
                data_frame(ranges = .)

range_tbls
## # A tibble: 6 × 1
##             ranges
##             <list>
## 1 <tibble [3 × 3]>
## 2 <tibble [3 × 3]>
## 3 <tibble [3 × 3]>
## 4 <tibble [3 × 3]>
## 5 <tibble [3 × 3]>
## 6 <tibble [3 × 3]>

At this point we are ready to merge the two tables.

range_tbls_and_files <- bind_cols(raw_tbl, range_tbls)

range_tbls_and_files %>% select(-path) # In this display we remove the path column just to better show the relevant data.
## # A tibble: 6 × 4
##       file polarity           raw           ranges
##     <fctr>    <chr>        <list>           <list>
## 1 ko15.CDF  unknown <S4: xcmsRaw> <tibble [3 × 3]>
## 2 ko16.CDF  unknown <S4: xcmsRaw> <tibble [3 × 3]>
## 3 ko18.CDF  unknown <S4: xcmsRaw> <tibble [3 × 3]>
## 4 ko19.CDF  unknown <S4: xcmsRaw> <tibble [3 × 3]>
## 5 ko21.CDF  unknown <S4: xcmsRaw> <tibble [3 × 3]>
## 6 ko22.CDF  unknown <S4: xcmsRaw> <tibble [3 × 3]>

Now the FUN begins. We use map2 from the purrr to go through each row of the table and run get_EICs on each combination of the raw and ranges columns.

range_tbls_and_files %<>% mutate(EIC = map2(raw,ranges, get_EICs  ))

range_tbls_and_files  %>% select(-path)
## # A tibble: 6 × 5
##       file polarity           raw           ranges        EIC
##     <fctr>    <chr>        <list>           <list>     <list>
## 1 ko15.CDF  unknown <S4: xcmsRaw> <tibble [3 × 3]> <list [3]>
## 2 ko16.CDF  unknown <S4: xcmsRaw> <tibble [3 × 3]> <list [3]>
## 3 ko18.CDF  unknown <S4: xcmsRaw> <tibble [3 × 3]> <list [3]>
## 4 ko19.CDF  unknown <S4: xcmsRaw> <tibble [3 × 3]> <list [3]>
## 5 ko21.CDF  unknown <S4: xcmsRaw> <tibble [3 × 3]> <list [3]>
## 6 ko22.CDF  unknown <S4: xcmsRaw> <tibble [3 × 3]> <list [3]>
# Inside each of the EIC lists we have a table for each EIC slice:
range_tbls_and_files$EIC[[1]]
## [[1]]
## # A tibble: 1,278 × 3
##     scan intensity  scan_rt
##    <int>     <dbl>    <dbl>
## 1      1       620 41.68963
## 2      2       439 41.71572
## 3      3       459 41.74180
## 4      4       446 41.76788
## 5      5       413 41.79397
## 6      6       428 41.82005
## 7      7       559 41.84613
## 8      8       693 41.87222
## 9      9       751 41.89830
## 10    10       684 41.92438
## # ... with 1,268 more rows
## 
## [[2]]
## # A tibble: 1,278 × 3
##     scan intensity  scan_rt
##    <int>     <dbl>    <dbl>
## 1      1      2598 41.68963
## 2      2      2515 41.71572
## 3      3      2452 41.74180
## 4      4      2525 41.76788
## 5      5      2717 41.79397
## 6      6      2805 41.82005
## 7      7      2745 41.84613
## 8      8      2644 41.87222
## 9      9      2727 41.89830
## 10    10      2972 41.92438
## # ... with 1,268 more rows
## 
## [[3]]
## # A tibble: 1,278 × 3
##     scan intensity  scan_rt
##    <int>     <dbl>    <dbl>
## 1      1      1356 41.68963
## 2      2      1473 41.71572
## 3      3       826 41.74180
## 4      4       751 41.76788
## 5      5      1131 41.79397
## 6      6       941 41.82005
## 7      7       918 41.84613
## 8      8       563 41.87222
## 9      9       592 41.89830
## 10    10       605 41.92438
## # ... with 1,268 more rows

So now we need to wrangle the data a bit to be able to unwrap those nested tables. We go into the EIC lists, merge the lists while we add a row to show which m/z slice each belongs to. While we are at it we change the new mz column to a factor.

range_tbls_and_files %<>% mutate(EIC = map2(EIC,ranges, ~ bind_rows(.x %>% setNames(.y$mz), .id = "mz") %>% 
                                                          mutate(mz = mz %>% as.numeric %>% as.factor) 
                                            )
                                 )

range_tbls_and_files$EIC[[1]]
## # A tibble: 3,834 × 4
##        mz  scan intensity  scan_rt
##    <fctr> <int>     <dbl>    <dbl>
## 1   508.1     1       620 41.68963
## 2   508.1     2       439 41.71572
## 3   508.1     3       459 41.74180
## 4   508.1     4       446 41.76788
## 5   508.1     5       413 41.79397
## 6   508.1     6       428 41.82005
## 7   508.1     7       559 41.84613
## 8   508.1     8       693 41.87222
## 9   508.1     9       751 41.89830
## 10  508.1    10       684 41.92438
## # ... with 3,824 more rows

So now we can get rid of those nested tables. First we remove the ranges and the raw columns because we cannot unnest everything together.

range_tbls_and_files %<>% select(-ranges, -raw) %>% unnest

3.5.1 One plot with all file/mz together

Now we can even plot all the EICs for all the files at the same time of we want.

p <- range_tbls_and_files %>% 
                                plot_chrom(RT_col = "scan_rt", Intensity_col = "intensity") +
                                facet_grid(file ~ mz) # since plot_chrom gives a ggplot2 object 
                                                      # we can continue manipulating it.

p

3.5.2 One plot for each file/mz

You can also generate a separate plot for each file/mz combination very easily:

# Nest the table again by file/mz
range_tbls_and_files %<>% group_by(file, mz, path) %>% nest(.key = "EIC")

range_tbls_and_files
## # A tibble: 18 × 4
##        file     mz
##      <fctr> <fctr>
## 1  ko15.CDF  508.1
## 2  ko15.CDF    279
## 3  ko15.CDF  577.3
## 4  ko16.CDF  508.1
## 5  ko16.CDF    279
## 6  ko16.CDF  577.3
## 7  ko18.CDF  508.1
## 8  ko18.CDF    279
## 9  ko18.CDF  577.3
## 10 ko19.CDF  508.1
## 11 ko19.CDF    279
## 12 ko19.CDF  577.3
## 13 ko21.CDF  508.1
## 14 ko21.CDF    279
## 15 ko21.CDF  577.3
## 16 ko22.CDF  508.1
## 17 ko22.CDF    279
## 18 ko22.CDF  577.3
## # ... with 2 more variables: path <fctr>, EIC <list>
# now make the plots
range_tbls_and_files %<>%   mutate(plot = map(EIC, ~ plot_chrom(.x, RT_col = "scan_rt", Intensity_col = "intensity")))
                                
range_tbls_and_files
## # A tibble: 18 × 5
##        file     mz
##      <fctr> <fctr>
## 1  ko15.CDF  508.1
## 2  ko15.CDF    279
## 3  ko15.CDF  577.3
## 4  ko16.CDF  508.1
## 5  ko16.CDF    279
## 6  ko16.CDF  577.3
## 7  ko18.CDF  508.1
## 8  ko18.CDF    279
## 9  ko18.CDF  577.3
## 10 ko19.CDF  508.1
## 11 ko19.CDF    279
## 12 ko19.CDF  577.3
## 13 ko21.CDF  508.1
## 14 ko21.CDF    279
## 15 ko21.CDF  577.3
## 16 ko22.CDF  508.1
## 17 ko22.CDF    279
## 18 ko22.CDF  577.3
## # ... with 3 more variables: path <fctr>, EIC <list>, plot <list>
range_tbls_and_files$plot[[1]]

4 Summarise EIC

This probably only makes sense if you are look for the max of a peak or if you have a contaminant and you want to know the median intensity etc.

EIC_summary <-  range_tbls_and_files %>% 
                select(-plot) %>% # can't unnest the plot and EIC at the same time 
                unnest %>% 
                group_by(file, mz, path) %>% 
                summarise(EIC_median = median(intensity), 
                          EIC_mean   = mean(intensity), 
                          EIC_sd     = sd(intensity), 
                          EIC_max    = max(intensity)
                          )

EIC_summary %>% select(-path) %>% kable # kable is just to show a nice table below instead of the print display
file mz EIC_median EIC_mean EIC_sd EIC_max
ko15.CDF 279 2268.5 13566.8670 72950.8749 805248
ko15.CDF 508.1 1479.0 34870.9906 158436.2794 1298944
ko15.CDF 577.3 1103.5 8531.4108 38677.2575 457024
ko16.CDF 279 1994.0 11381.0313 52718.6966 547904
ko16.CDF 508.1 1731.5 34384.7128 143008.7030 1115648
ko16.CDF 577.3 1002.5 4645.3091 16952.8447 143744
ko18.CDF 279 2241.0 12305.5156 61206.9076 632896
ko18.CDF 508.1 2963.0 31214.2121 125154.1095 1039296
ko18.CDF 577.3 696.5 2240.9390 5490.7539 28512
ko19.CDF 279 1748.5 8695.0156 39067.6209 361536
ko19.CDF 508.1 3014.5 19624.6487 79410.1471 695552
ko19.CDF 577.3 755.0 761.6150 447.4914 3365
ko21.CDF 279 1677.5 10570.1776 49061.6622 483776
ko21.CDF 508.1 1756.0 20184.8232 81555.5366 687680
ko21.CDF 577.3 877.5 982.4225 678.2650 3994
ko22.CDF 279 1701.5 8992.5853 44103.5943 465600
ko22.CDF 508.1 1689.5 17618.2261 72745.8346 627712
ko22.CDF 577.3 893.5 953.6995 603.1006 2528